density peak
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (15 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
- North America > United States > Kansas (0.04)
- North America > Canada (0.04)
An unsupervised tour through the hidden pathways of deep neural networks
The goal of this thesis is to improve our understanding of the internal mechanisms by which deep artificial neural networks create meaningful representations and are able to generalize. We focus on the challenge of characterizing the semantic content of the hidden representations with unsupervised learning tools, partially developed by us and described in this thesis, which allow harnessing the low-dimensional structure of the data. Chapter 2. introduces Gride, a method that allows estimating the intrinsic dimension of the data as an explicit function of the scale without performing any decimation of the data set. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among nearest data points. In Chapter 3, we study the evolution of the probability density across the hidden layers in some state-of-the-art deep neural networks. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant to classification. In subsequent layers, density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. This process leaves a footprint in the probability density of the output layer, where the topography of the peaks allows reconstructing the semantic relationships of the categories. In Chapter 4, we study the problem of generalization in deep neural networks: adding parameters to a network that interpolates its training data will typically improve its generalization performance, at odds with the classical bias-variance trade-off. We show that wide neural networks learn redundant representations instead of overfitting to spurious correlation and that redundant neurons appear only if the network is regularized and the training error is zero.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Kansas (0.04)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- (4 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
- Health & Medicine > Therapeutic Area > Neurology (0.67)
- Education (0.63)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (15 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
- North America > United States > Kansas (0.04)
- North America > Canada (0.04)
GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data
Chen, Yewang, Li, Junfeng, Xia, Shuyin, Lai, Qinghong, Gao, Xinbo, Wang, Guoyin, Cheng, Dongdong, Liu, Yi, Wang, Yi
To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.
- Asia > China > Chongqing Province > Chongqing (0.05)
- Asia > China > Fujian Province > Xiamen (0.04)
LGBQPC: Local Granular-Ball Quality Peaks Clustering
Jia, Zihang, Zhang, Zhen, Pedrycz, Witold
The density peaks clustering (DPC) algorithm has attracted considerable attention for its ability to detect arbitrarily shaped clusters based on a simple yet effective assumption. Recent advancements integrating granular-ball (GB) computing with DPC have led to the GB-based DPC (GBDPC) algorithm, which improves computational efficiency. However, GBDPC demonstrates limitations when handling complex clustering tasks, particularly those involving data with complex manifold structures or non-uniform density distributions. To overcome these challenges, this paper proposes the local GB quality peaks clustering (LGBQPC) algorithm, which offers comprehensive improvements to GBDPC in both GB generation and clustering processes based on the principle of justifiable granularity (POJG). Firstly, an improved GB generation method, termed GB-POJG+, is developed, which systematically refines the original GB-POJG in four key aspects: the objective function, termination criterion for GB division, definition of abnormal GB, and granularity level adaptation strategy. GB-POJG+ simplifies parameter configuration by requiring only a single penalty coefficient and ensures high-quality GB generation while maintaining the number of generated GBs within an acceptable range. In the clustering phase, two key innovations are introduced based on the GB k-nearest neighbor graph: relative GB quality for density estimation and geodesic distance for GB distance metric. These modifications substantially improve the performance of GBDPC on datasets with complex manifold structures or non-uniform density distributions. Extensive numerical experiments on 40 benchmark datasets, including both synthetic and publicly available datasets, validate the superior performance of the proposed LGBQPC algorithm.
- Asia > Macao (0.04)
- Asia > China > Liaoning Province > Dalian (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (6 more...)
How to optimize K-means?
Center-based clustering algorithms (e.g., K-means) are popular for clustering tasks, but they usually struggle to achieve high accuracy on complex datasets. We believe the main reason is that traditional center-based clustering algorithms identify only one clustering center in each cluster. Once the distribution of the dataset is complex, a single clustering center cannot strongly represent distant objects within the cluster. How to optimize the existing center-based clustering algorithms will be valuable research. In this paper, we propose a general optimization method called ECAC, and it can optimize different center-based clustering algorithms. ECAC is independent of the clustering principle and is embedded as a component between the center process and the category assignment process of center-based clustering algorithms. Specifically, ECAC identifies several extended-centers for each clustering center. The extended-centers will act as relays to expand the representative capability of the clustering center in the complex cluster, thus improving the accuracy of center-based clustering algorithms. We conducted numerous experiments to verify the robustness and effectiveness of ECAC. ECAC is robust to diverse datasets and diverse clustering centers. After ECAC optimization, the accuracy (NMI as well as RI) of center-based clustering algorithms improves by an average of 33.4% and 64.1%, respectively, and even K-means accurately identifies complex-shaped clusters.
MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions
Turishcheva, Polina, Hansel, Laura, Ritzert, Martin, Weis, Marissa A., Ecker, Alexander S.
Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.
- Europe > Netherlands > South Holland > Leiden (0.26)
- Europe > Germany > Lower Saxony > Gottingen (0.15)
The representation landscape of few-shot learning and fine-tuning in large language models
Doimo, Diego, Serra, Alessandro, Ansuini, Alessio, Cazzaniga, Alberto
In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (16 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)